Effective Information Integration from Disparate Microarray Datasets
نویسندگان
چکیده
Although most genes in human DNA have been sequenced, the functional relationships between the genes are not fully understood. The problem with obtaining information about which genes are linked to a specific type of cancer and how these genes are interrelated is an important topic. With the development of DNA microarray and other biological devices, expression levels of thousands of genes can be measured at the same time. Due to such technological advances, data is produced at a very fast rate. Analyzing and integrating this data efficiently is thus paramount. However, the extraction of biologically relevant data is a challenging task and the small number of patients is a limiting factor. Information was integrated from studies performed at different institutions using various microarray technologies. All four datasets from CAMDA 2003 were initially considered but Harvard and Michigan datasets were chosen. Thus, 289 patients were considered, 203 from Harvard and 86 from Michigan. This required us to integrate the Harvard and Michigan datasets, which used two different Affymetrix oligonucleotide microarrays (HU6800 and HGU95a Chips). The bioconductor packages implemented in R was used to obtain gene expression values from the CEL files. The datasets were RMA normalized and inactive genes with a small standard deviation were eliminated. The comparison spreadsheet obtained from the Affymetrix website was used to integrate the resulting datasets resulting in 1,709 common probesets for 289 patients. The integrated dataset was renormalized with RMA in order to eliminate any experimental bias caused by different environments and technologies. Quite a large number and type of clustering methods have been applied on microarray datasets such as k-means clustering, hierarchical clustering and self-organizing maps (SOM). However, most of these algorithms focus on clustering along one dimension. Typically, the microarray data is arranged as a matrix. Thus, one would like a system that can simultaneously cluster both dimensions of a matrix by exploiting both the rows and the columns. Co-clustering differs from clustering along one dimension in that at all stages, row clusters incorporates column cluster information and vice versa. We applied the minimum sum squared residue co-clustering to simultaneously cluster rows (probesets) and columns (patients) at the same time (Cho et al., 2004).We picked 25 as the number of probeset cluster because it resulted in reasonable sized clusters. Having too many probesets inside a cluster leads to a large number of accidental relations and decreases the reliability and interpretability of the final probeset clusters. Patient clusters were chosen based on the different types of diseases on each dataset (5 and 2 were selected as column cluster size for Harvard and Michigan respectively). The resultant final probeset clusters obtained from the co-clustering algorithm formed the basis for our analysis.
منابع مشابه
Integration and Reduction of Microarray Gene Expressions Using an Information Theory Approach
The DNA microarray is an important technique that allows researchers to analyze many gene expression data in parallel. Although the data can be more significant if they come out of separate experiments, one of the most challenging phases in the microarray context is the integration of separate expression level datasets that have gathered through different techniques. In this paper, we prese...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملOn integrating multi-experiment microarray data.
With the extensive use of microarray technology as a potential prognostic and diagnostic tool, the comparison and reproducibility of results obtained from the use of different platforms is of interest. The integration of those datasets can yield more informative results corresponding to numerous datasets and microarray platforms. We developed a novel integration technique for microarray gene-ex...
متن کاملIntegration of pre-normalized microarray data using quantile correction
An enormous amount of microarray data has been collected and accumulated in public repositories. Although some of the depositions include raw and processed data, significant parts of them include processed data only. If we need to combine multiple datasets for specific purposes, the data should be adjusted prior to use to remove bias between the datasets. We focused on a GeneChip platform and a...
متن کاملSFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy
In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....
متن کامل